Add backend-aware test parameter scaling for pure Python CI by igerber · Pull Request #129 · igerber/diff-diff

igerber · 2026-02-01T23:32:49Z

Summary

Add CIParams class and ci_params session-scoped pytest fixture in conftest.py that detects pure Python mode and scales down bootstrap iterations and TROP grid sizes
Update ~89 test functions across 7 test files to use ci_params.bootstrap(n) for bootstrap iteration counts and ci_params.grid(values) for TROP lambda grids
Make 11 hardcoded assertion values dynamic to match scaled parameters
Leave API-only tests (no .fit() calls) and warning-threshold tests unchanged

Methodology references (required if estimator / math changes)

N/A - no methodology changes. Only test parameter scaling; same code paths exercised with fewer iterations in pure Python mode.

Validation

Tests added/updated: tests/conftest.py, tests/test_trop.py, tests/test_staggered.py, tests/test_wild_bootstrap.py, tests/test_sun_abraham.py, tests/test_estimators.py, tests/test_methodology_callaway.py, tests/test_methodology_did.py
All 387+ tests across modified files pass with Rust backend
With Rust backend, all values pass through unchanged (zero behavior change)

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Introduce ci_params session-scoped fixture that scales bootstrap iterations and TROP grid sizes when running without Rust backend, reducing CI time while preserving code path coverage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-02-01T23:37:31Z

Overall assessment: ⚠️ Needs changes

Executive summary:

P1: The TROP paper-conformance test now accepts as few as 11 successful bootstrap draws even when n_boot=30, weakening the regression guard for stratified sampling.
P2: Bootstrap scaling in pure Python mode reduces methodology convergence tests to ~35 iterations, which can make SE-convergence assertions noisy and less meaningful.
No estimator or inference logic changes detected in this diff; methodology registry requirements appear unchanged.
AGENTS.md not found in repo; proceeded without additional agent instructions.

Methodology

Severity: P1 | Impact: The paper-conformance test for stratified bootstrap now allows as few as 11 successful draws even when n_boot=30, which weakens detection of regressions in Algorithm 3’s stratified sampling behavior. | Concrete fix: Make the threshold proportional to n_boot (e.g., >= max(11, int(0.67 * n_boot)) or >= n_boot - 5) so Rust-backend runs keep the stronger expectation while pure Python scaling remains feasible. tests/test_trop.py:L1615-L1669
Severity: P3 | Impact: No estimator/assumption/inference code paths were changed in this diff; registry requirements are unaffected. | Concrete fix: None. docs/methodology/REGISTRY.md:L1-L40

Code Quality

Severity: P3 | Impact: No code-quality issues identified in the changed test/fixture code beyond findings above. | Concrete fix: None.

Performance

Severity: P3 | Impact: Changes are test-only performance optimizations; no production performance impact observed. | Concrete fix: None.

Maintainability

Severity: P3 | Impact: No maintainability issues introduced beyond findings above. | Concrete fix: None.

Tech Debt

Severity: P3 | Impact: No new tech-debt concerns introduced in this diff. | Concrete fix: None.

Security

Severity: P3 | Impact: No security-sensitive changes or secrets observed in the diff. | Concrete fix: None.

Documentation/Tests

Severity: P2 | Impact: Methodology convergence tests now use ci_params.bootstrap(499), which scales to ~35 iterations in pure Python mode; this can make the 20–25% SE-convergence assertions noisy and reduce methodological verification strength. | Concrete fix: Exempt these convergence tests from scaling (keep high n_boot), or conditionally relax/skip when n_boot falls below a minimum threshold. tests/conftest.py:L99-L113, tests/test_methodology_callaway.py:L803-L839

…convergence tests The sqrt(n)*1.6 scaling reduced n=499 to 35 iterations — too aggressive for SE convergence tests. Add min_n keyword to CIParams.bootstrap() so these tests can request a higher floor. Also make TROP stratified bootstrap threshold proportional to n_boot (restoring Rust-mode behavior from main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-07T21:02:40Z

/ai-review

github-actions · 2026-02-07T21:06:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 647841118c89a4dc50179c53bd8e73e0b203c6d8

Overall assessment: ⚠️ Needs changes

Executive summary

No estimator/math/identification logic changes in the diff; methodology registry alignment remains unchanged.
ci_params fixture centralizes backend-aware scaling for bootstrap and TROP grids, improving CI performance.
Potential pure‑Python CI failure: stratified bootstrap test now requires 100% success when scaled to 11 draws.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P1 — Impact: In pure‑Python mode ci_params.bootstrap(30) scales to 11, and the new threshold min_successes = max(11, int(0.67 * n_boot)) requires all 11 bootstrap draws to succeed, despite the test’s own comment (“Most iterations succeed”). If any draw fails (e.g., solver non‑convergence), this becomes a deterministic CI failure in Python backend. — Concrete fix: Remove the hard 11‑draw floor or cap by a fraction only (e.g., min_successes = max(3, math.ceil(0.67 * n_boot))), or set a higher min_n for this test so the 67% threshold remains meaningful. (tests/test_trop.py:L1650-L1671)

… Python CI In pure Python mode (n_boot=11), max(11, int(0.67*11)) required 100% bootstrap success rate. Lowering floor to 5 gives max(5,7)=7 (~64%), consistent with Rust mode's 67% threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-07T21:28:40Z

/ai-review

github-actions · 2026-02-07T21:34:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cf37b5ae5ad7e45ab14548c2c3a15b8983da5dca

Overall assessment: ✅ Looks good

Executive summary

No estimator, weighting, variance/SE, or identification logic changes in this diff; Methodology Registry impact is nil.
CIParams introduces backend-aware scaling; most bootstrap/grid usage now adapts to pure-Python CI.
Minor follow-ups: one SDID test still bypasses scaling; CIParams.grid lacks direct unit coverage; one docstring still claims fixed 499 iterations.

Methodology

No findings (tests-only changes; no estimator/math changes to cross-check against docs/methodology/REGISTRY.md).

Code Quality

No findings.

Performance

P3 — Impact: one .fit() test still uses a fixed n_bootstrap=50, so pure-Python CI won’t get the intended scaling speedup. Fix: inject ci_params and use ci_params.bootstrap(50) unless this test must lock to 50 for a warning threshold. Location: tests/test_estimators.py:L3363-L3371.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P3 — Impact: CIParams.grid behavior is untested; future refactors could silently change grid scaling. Fix: add unit tests covering len(values) <= 3 passthrough and len(values) > 3 first/middle/last selection. Location: tests/conftest.py:L118-L123.
P3 — Impact: docstring claims a fixed “499 bootstrap iterations,” but pure-Python mode now scales down (e.g., to 199), which can mislead reviewers. Fix: update text to “up to 499” or mention backend-aware scaling. Location: tests/test_methodology_callaway.py:L803-L815.

igerber merged commit 4d128ee into main Feb 8, 2026
7 checks passed

igerber deleted the review-ci-tests branch February 8, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backend-aware test parameter scaling for pure Python CI#129

Add backend-aware test parameter scaling for pure Python CI#129
igerber merged 3 commits intomainfrom
review-ci-tests

igerber commented Feb 1, 2026

Uh oh!

github-actions bot commented Feb 1, 2026

Uh oh!

igerber commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

igerber commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Feb 1, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 1, 2026

Uh oh!

igerber commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant